With the advancement of single-cell technologies, we can now profile multiple modalities in the same cell. There are many examples of such approaches:
One of the most popular methods is the Epi Multiome ATAC + Gene Expression from 10x Genomics. You can find the workflow here.
In this tutorial, we will use single-cell RNA-ATAC multiomic data
from 18-day-old brain organoids, as described in the study Inferring
and perturbing cell fate regulomes in human brain organoids. We will
analyze the scMultiome data in R using Seurat and Signac
Optional: You can download the raw sequencing data
(FASTQ files) from E-MTAB-12002.
For this tutorial, we re-generated transcript count and peak
accessibility matrices using cellranger-arc count
(v2.0.2) with refdata-cellranger-arc-GRCh38-2020-A-2.0.0.
Note: we typically run cellranger pipelines on HPC clusters.
Begin by loading the necessary libraries:
library(EnsDb.Hsapiens.v86) # Annotation database
library(biovizBase)
library(Seurat)
library(Signac)
library(Matrix)
library(dplyr)
library(tidyr)
library(ggplot2)
library(presto)
library(pheatmap)
library(yaml)
library(tibble)
library(clustree)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggrepel)
library(DESeq2)
Note: Both Seurat and
Signac are under active development. To ensure
compatibility, verify your package versions:
packageVersion("Seurat")
[1] '5.2.1'
Before we dive into the actual analysis, it is always good to examine the dataset. You can check the detailed sample information here, or from the downloaded files:
# Summarize sample information by genetic modification and modality
sample_info %>%
mutate(
Modality = case_when(
grepl("ATAC", Assay.Name) ~ "ATAC",
grepl("RNA", Assay.Name) ~ "RNA",
TRUE ~ "Other"
)
) %>%
group_by(Characteristics.genetic.modification.) %>%
summarize(
ATAC_samples = paste(unique(Assay.Name[Modality == "ATAC"]), collapse = ", "),
RNA_samples = paste(unique(Assay.Name[Modality == "RNA"]), collapse = ", ")
)
In this dataset, S1 and S2 denote the same
biological sample sequenced across two different lanes. There are three
paired RNA+ATAC datasets: two CRISPR knockouts and one control. Each
pair corresponds to a specific sample ID:
PS: GLI3 is zinc finger protein which is involved in Sonic hedgehog (Shh) signaling.
To streamline the analysis, multiple runs (i.e., results from
cellranger-arc count) were aggregated using cellranger-arc aggr.
The aggregation was guided by a libraries_aggr.csv
file:
libraries <- read.csv('data/libraries_aggr.csv')
libraries
Key points:
aggr combines count matrices (not an
integration step).aggr merges all fragments and performs
new peak calling, ensuring a unified peak set (check here).Note: Alternatively, peak calling for each sample
can be performed using tools like MACS3
or the Signac::CallPeaks.
Subsequently, the union/intersection of peaks across samples can be
selected for downstream analysis.
Similar to the standard pipeline for single-cell RNA-seq (scRNA-seq)
data, we can use the output folder filtered_feature_bc_matrix.
count_aggr <- Read10X("data/filtered_feature_bc_matrix/")
Since this dataset includes two modalities (RNA & ATAC), the output will contain two matrices. We can check their dimensions as follows:
dim(count_aggr$`Gene Expression`)
[1] 36601 25327
dim(count_aggr$Peaks)
[1] 301573 25327
Before proceeding, we confirm whether the barcodes (cell identities) in both matrices match:
all(colnames(count_aggr$`Gene Expression`) == colnames(count_aggr$Peaks))
[1] TRUE
If TRUE, the barcodes are perfectly aligned across both modalities. If FALSE, further investigation is needed to identify discrepancies.
The RNA assay generates a GENE × CELL matrix, which already provides useful information:
rownames(count_aggr$`Gene Expression`) %>% head()
[1] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" "AL627309.3" "AL627309.2"
seurat <- CreateSeuratObject(counts = count_aggr$`Gene Expression`,
assay = "RNA")
seurat
An object of class Seurat
36601 features across 25327 samples within 1 assay
Active assay: RNA (36601 features, 0 variable features)
1 layer present: counts
Unlike RNA data, chromatin accessibility data is stored as a PEAK × CELL matrix:
rownames(count_aggr$Peaks) %>% head()
[1] "chr1:9795-10680" "chr1:180701-181102" "chr1:181191-181694" "chr1:267570-268463"
[5] "chr1:585750-586644" "chr1:629483-630394"
Each feature (row) represents a genomic region (peak) in the format:
chromosome:start-end. And we will notice that the matrix is
still very big. To not kill this R session because of memory
requirements, we can do some filtering first:
min_cells <- round(0.05 * ncol(count_aggr$Peaks)) # Keep peaks present in at least X% of cells
filtered_peaks <- count_aggr$Peaks[rowSums(count_aggr$Peaks > 0) >= min_cells, ]
min_reads <- 100 # Keep peaks with at least Y reads
filtered_peaks <- filtered_peaks[rowSums(filtered_peaks) >= min_reads, ]
dim(filtered_peaks) # Check new dimensions
[1] 55566 25327
To better interpret the peaks, we can annotate them with the
nearest genes. Many genomic databases and tools provide
R interfaces, making R a powerful environment for genomic data analysis.
Here we follow the Seurat
WNN tutorial:
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
################ NOTE #################
## StringToGRanges keep crushing (?) ###
## So we do it manually ##
########################################
# grange.counts <- StringToGRanges(rownames(filtered_peaks), sep = c(":", "-"))
# rownames(filtered_peaks) contains chromosome coordinates in 'chr:start-end' format
peak_names <- rownames(filtered_peaks) # Example: c("chr1:1-10", "chr2:12-3121")
# Split based on ":" first (to separate 'chr' from 'start-end')
split_peaks <- strsplit(peak_names, ":")
# Extract chromosome names
chr <- sapply(split_peaks, `[`, 1)
# Further split the second part (start-end) by "-"
ranges <- sapply(split_peaks, `[`, 2)
ranges_split <- strsplit(ranges, "-")
# Extract start and end positions as integers
start <- as.integer(sapply(ranges_split, `[`, 1))
end <- as.integer(sapply(ranges_split, `[`, 2))
# Create GRanges object
grange.counts <- GRanges(seqnames = chr, ranges = IRanges(start = start, end = end))
# We'll only use peaks in standard chromosomes
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
filtered_peaks <- filtered_peaks[as.vector(grange.use), ]
We can check the filtered matrix dimension:
dim(filtered_peaks)
[1] 55548 25327
Important: There are other ways of how we could
reduce the number of peaks. For example, as mentioned
above, if you are using cellranger-arc count outputs
directly, you can manually curate a list of peaks that are present in
multiple samples (check here).
For large datasets, consider leveraging HPC clusters with higher memory
capacities.
frag.file <- "data/atac_fragments.tsv.gz"
# Finally create the ATAC assay
seurat[['ATAC']] <- CreateChromatinAssay(
counts = filtered_peaks,
ranges = grange.counts[grange.use],
genome = 'hg38',
fragments = frag.file,
annotation = annotations
)
Now we have the seurat object with both RNA and ATAC assays:
seurat
An object of class Seurat
92149 features across 25327 samples within 2 assays
Active assay: RNA (36601 features, 0 variable features)
1 layer present: counts
1 other assay present: ATAC
Optional: Save the object.
saveRDS(
object = seurat,
file = "out/raw_seurat.Rds"
)
This section is largely inspired from this tutorial.
# if you start from the prepared seurat object
seurat <- readRDS('out/raw_seurat.Rds')
For the RNA assay, what we would check is the same as the typical scRNA-seq (i.e., filter out cells with too few or too many detected genes or transcripts (UMIs)).
seurat <- PercentageFeatureSet(seurat, pattern = "^MT-", col.name = "percent.mt", assay = "RNA")
For the ATAC assay, we would also look at the number of detected peaks or detected fragments, similar to the RNA assay. On top of that, we would also exclude cells with too weak fragment enrichment around the transcriptional start sites (TSS). We can also quantify the approximate ratio of ATAC fragments with a strong nucleosome banding pattern (those with fragment lengths around a single nucleosome) that unlikely represent real accessible genomic regions, in relative to the nucleosome-free fragments, and then discard cells with too high of such ratio.
seurat <- NucleosomeSignal(seurat, assay = "ATAC")
For TSS enrichment, we do some tricks here to reduce the computing time:
# Extract TSS annotation
tss <- GetTSSPositions(Annotation(seurat@assays$ATAC))
seurat <- TSSEnrichment(seurat,tss.positions=sample(tss,1000),assay = "ATAC")
After aggregation, barcodes are typically appended with a suffix indicating their origin (e.g., -1, -2, -3). You can use these suffixes to assign sample identities:
# Extract sample identifiers from cell barcodes
seurat$sample <- sapply(strsplit(colnames(seurat), "-"), `[`, 2)
seurat$sample <- plyr::mapvalues(
seurat$sample,
from = c("1", "2", "3"),
to = c("WT", "KO1", "KO2")
)
We can have a look at the QC results for each sample here:
VlnPlot(seurat,
features = c("nFeature_RNA",
"percent.mt",
"nFeature_ATAC",
"TSS.enrichment",
"nucleosome_signal"),
ncol = 3, group.by='sample',
pt.size = 0)
Based on the distributions, we can set the filtering criteria as:
seurat <- subset(seurat,
subset = nFeature_RNA > 1000 &
nFeature_RNA < 7500 &
percent.mt < 40 &
nFeature_ATAC > 1000 &
nFeature_ATAC < 25000 &
TSS.enrichment > 1 &
nucleosome_signal < 1
)
The analysis for the scRNA assay is rather standard, including data
normalization, highly variable genes identification, data scaling,
principal component analysis (PCA), and then the UMAP embedding.
Additionally, we use the CCA integration from seurat
following this tutorial:
DefaultAssay(seurat) <- "RNA"
# to make it work with seurat v5
seurat[['RNA']] <- split(seurat[["RNA"]], f = seurat$sample)
seurat <- NormalizeData(seurat) %>%
FindVariableFeatures(nfeatures = 3000) %>%
ScaleData() %>%
RunPCA(npcs = 50)
seurat <- IntegrateLayers(object = seurat, method = CCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.cca",
verbose = FALSE)
seurat[['RNA']] <- JoinLayers(seurat[['RNA']])
seurat <- RunUMAP(seurat, dims = 1:30, reduction = "integrated.cca",
reduction.name = "umap_rna", reduction.key = "UMAPINTEGRATEDCCARNA_")
For visualization, we use the marker genes mentioned in this tutorial.
# Visualization
marker_genes <- c("MKI67","NES","DCX","FOXG1","DLX2","EMX1","OTX2","LHX9","TFAP2A")
p1 <- DimPlot(seurat, group.by = "sample", reduction = "umap_rna") & NoAxes()
p2 <- FeaturePlot(seurat,
marker_genes,
reduction = "umap_rna") & NoAxes() & NoLegend()
p1 | p2
We can already see that the 2 KOs and WT are nicely integrated. From maker genes we can also see the separation of progenitors (e.g. NES) and neurons (e.g. DCX), as well as brain regional identities (e.g. FOXG1). We can also do clustering analysis based on RNAseq data alone:
seurat <- FindNeighbors(seurat, reduction = "integrated.cca", dims = 1:30)
We can use clustree to build a clustering tree, which is
useful to visualize the relationships between clustering
resolutions.
# Make Plot
clustree(seurat, layout="sugiyama", prefix = "RNA_snn_res.")
NOTE: There are indeed many other criteria and algorithms to evalute the number of clusters. Here to make things simple, we just choose one resolution where the clusters are relatively stable.
p1 <- DimPlot(seurat, group.by = "RNA_snn_res.0.7",
reduction = "umap_rna", label = T) & NoAxes() & NoLegend()
p1 | p2
ATAC-seq data analysis follows a similar workflow to scRNA-seq, including normalization, feature selection, linear and non-linear dimensional reduction, and data integration. However, due to differences in data properties, the specific algorithms used in each step vary.
This section primarily follows the official
tutorial from Signac together with this multiome
analysis tutorial.
Feature selection:The low dynamic range of scATAC-seq data makes it challenging to perform feature selection based on variablity as in scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less than n cells (similar to what we did when creating this seurat object).
DefaultAssay(seurat) <- "ATAC"
# select the top 25% most common features for faster runtime
seurat <- FindTopFeatures(seurat, min.cutoff = 'q75')
Normalization: Signac performs Term Frequency-Inverse Document Frequency (TF-IDF) normalization. This is a two-step normalization procedure:
seurat <- RunTFIDF(seurat)
Linear dimension reduction: We next run Singular Value Decomposition (SVD) to the TD-IDF matrix, which gives Latent Semantic Indexing (LSI) components. This is very similar to PCA, but it can better handle the sparsity (as we can see ATAC matrix is much bigger than the RNA one). The first singular vector often captures sequencing depth (technical variation) rather than biological variation.
seurat <- RunSVD(seurat, n = 50)
# check what the LSI dimensions explain
p1 <- ElbowPlot(seurat, ndims = 30, reduction="lsi")
p2 <- DepthCor(seurat, n = 30) # correlation to sequencing depth
p1 | p2
Non-linear dimension reduction: Similar to scRNA-seq data, we can visualize the data using UMAP:
seurat <- RunUMAP(seurat,
reduction = "lsi",
dims = 2:30,
reduction.name = "umap_atac",
reduction.key = "UMAPATAC_")
p1 <- DimPlot(seurat,
group.by = "sample",
reduction = "umap_atac") & NoAxes()
p2 <- FeaturePlot(seurat,
marker_genes,
reduction = "umap_atac") & NoAxes() & NoLegend()
p1 | p2
Here, both panels are shown in the UMAP space derived from ATAC data. On the left, we observe a relatively uniform “blob-like” structure comparing to the UMAP space from RNA assay. However, when RNA expression from the multiome is overlaid on the same ATAC UMAP (right), we can still see a separation between progenitors and neurons (e.g., MKI67, NES, DCX), as well as evidence of brain regionalization (e.g., FOXG1, DLX2, EMX1, OTX2, LHX2, TFAP2A), indicating that RNA remains the dominant source of information for cell identity and state.
To estimate gene activity, we assess chromatin accessibility
associated with each gene and generate an inferred RNA assay from the
scATAC-seq data. This can be done using a wrapper function
GeneActivity():
gene.activities <- GeneActivity(seurat)
seurat[['RNA_inferred']] <- CreateAssayObject(gene.activities) %>% NormalizeData()
DefaultAssay(seurat) <- "RNA_inferred"
p3 <- FeaturePlot(seurat,
marker_genes,
reduction = "umap_atac") & NoAxes() & NoLegend()
p1 | p3
Here, we show gene activity scores inferred from ATAC profiles. While the activity patterns for some genes, such as FOXG1, partially recapitulate the expression patterns observed in the RNA assay (previous figure), the overall separation between cell types is less distinct. This highlights that while gene activity from ATAC can provide useful proxies for transcriptional programs, it lacks the resolution and specificity of direct RNA measurements.
Integration of ATAC dataset is very similar to RNA, but we will use
the LSI components. However, currently Seurat v5 is not
compatible with ChromatinAssay, so we will use the old(er)
approach. This section largely follows the scATAC-seq
data integration tutorial.
DefaultAssay(seurat) <- 'ATAC'
integration.anchors <- FindIntegrationAnchors(
object.list = SplitObject(seurat, "sample"),
reduction = "rlsi", # we use Reciprocal LSI
dims = 2:30 # ignore the first LSI components
)
# Integrate LSI embeddings
integrated <- IntegrateEmbeddings(
anchorset = integration.anchors,
reductions = seurat[["lsi"]],
new.reduction.name = "integrated.lsi",
dims.to.integrate = 1:30
)
# Add the dimension reduction to the original seurat
seurat[['integrated.lsi.atac']] <- CreateDimReducObject(
Embeddings(integrated, "integrated.lsi")[colnames(seurat),],
key="INTEGRATEDLSIATAC_", assay="ATAC"
)
seurat <- RunUMAP(seurat,
reduction = "integrated.lsi.atac",
dims = 2:30,
reduction.name = "umap_atac",
reduction.key = "UMAPINTEGRATEDLSIATAC_")
Finally, we can visualize our cells in the integrated ATAC space:
DefaultAssay(seurat) <- 'RNA'
p1 <- DimPlot(seurat,
group.by = "sample",
reduction = "umap_atac") & NoAxes()
p2 <- FeaturePlot(seurat,
marker_genes,
reduction = "umap_atac") & NoAxes() & NoLegend()
p1 | p2
Optional: Save the object.
saveRDS(seurat, 'out/unimodal_seurat.Rds')
# if you start from separately integrated data
seurat <- readRDS('out/unimodal_seurat.Rds')
RNA and ATAC assays capture different aspects of cell identity and regulation. As discussed earlier, RNA typically remains the dominant source of information for defining cell identity and state. We can visualize this by comparing the RNA and ATAC embeddings side by side.
DefaultAssay(seurat) <- "RNA"
Reduce("|", lapply(c("umap_rna",
"umap_atac"), function(dr){
p1 <- DimPlot(seurat,
group.by = "RNA_snn_res.0.7",
reduction = dr) & NoLegend()
p2 <- FeaturePlot(seurat,
marker_genes,
reduction = dr) & NoAxes() & NoLegend()
p1 / p2
}))
This side-by-side visualization helps illustrate how each modality captures structure in the data. The RNA-based embedding often shows clearer separation of cell types, while the ATAC embedding may capture regulatory variation not visible in RNA alone.
So far, we have treated the two modalities separately, given their distinct data characteristics. A common strategy for integrating ATAC-seq with RNA-seq data is to infer gene activity scores (e.g., Gene.Activity) from chromatin accessibility profiles.
However, since we are working with multiome data—where both ATAC and RNA are measured from the same cell—we can go a step further. Weighted Nearest Neighbor (WNN) analysis provides an unsupervised framework that learns the relative contribution of each modality per cell. This allows us to perform a more integrated analysis that balances both transcriptomic and chromatin information.
This part mainly follows the Weighted
Nearest Neighbor Analysis tutorial from Seurat.
DefaultAssay(seurat) <- 'RNA'
seurat <- FindMultiModalNeighbors(seurat,
reduction.list = list("integrated.cca", "integrated.lsi.atac"),
dims.list = list(1:50, 2:30))
seurat <- RunUMAP(seurat, nn.name = "weighted.nn",
reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
# Here we just choose one resolution, but feel free to change this parameter
seurat <- FindClusters(seurat, graph.name = "wsnn", resolution = 0.7, verbose = FALSE)
p1 <- DimPlot(seurat,group.by = "sample",reduction = "wnn.umap")
p2 <- DimPlot(seurat,group.by = "wsnn_res.0.7",label=T, reduction = "wnn.umap") & NoLegend()
p3 <- FeaturePlot(seurat,
marker_genes,
reduction = "wnn.umap") & NoAxes() & NoLegend()
p1 / p2 | p3
This visualization highlights how the WNN approach brings together complementary signals from RNA and ATAC to refine cell clustering and interpretation.
This part we try to answer this philosophical question: Why do we use multiome? Is it just because we can (afford to) do it?
To begin answering this, let’s compare clustering results:
# RNA-only clustering
p1 <- DimPlot(seurat, group.by = "RNA_snn_res.0.7", reduction = "umap_rna", label = T) +
ggtitle("RNA-only Clustering")
# WNN clustering
p2 <- DimPlot(seurat, group.by = "wsnn_res.0.7", reduction = "wnn.umap", label = T) +
ggtitle("WNN Clustering")
p1 & NoLegend()| p2 & NoLegend() # Combine with patchwork
At first glance, the two clustering results look quite similar. That’s not surprising—RNA often carries the strongest signal for cell identity, especially for broad cell types.
However, when using the same clustering resolution, adding the ATAC modality via WNN can refine local structures and help separate subpopulations that RNA alone might miss.
Let’s quantify the similarity between RNA-only and WNN clusters using a contingency table:
table(RNA=seurat$RNA_snn_res.0.7, WNN=seurat$wsnn_res.0.7) %>%
heatmap(xlab='WNN clustering', ylab='RNA-only Clustering', Rowv = NA, Colv = NA)
In some cases, WNN might uncover subtle transitions or regulatory shifts that are invisible in transcriptomic space alone.
In this step, we use the WNN clustering results to annotate cell types based on marker gene expression.
DefaultAssay(seurat) <- "RNA"
DE_wnn <- wilcoxauc(seurat, "wsnn_res.0.7", seurat_assay = "RNA")
top_markers_wnn <- DE_wnn %>%
filter(abs(logFC) > log(1.2) &
padj < 0.01 &
auc > 0.65 &
pct_in - pct_out > 30 &
pct_out < 20) %>%
group_by(group) %>%
top_n(10, wt = auc)
top_markers_wnn
From here we can already identify some markers of very early development stages (i.e POU5F1,POU3F1 and RFX4).
Reference Markers from Brain Organoid Data
We provide a curated YAML file from a published brain organoid study, which includes useful marker genes for annotation:
yaml_data <- read_yaml('data/marker_genes.yaml')
df <- as.data.frame(do.call(rbind, lapply(yaml_data, as.list)))
DotPlot(seurat, df$marker_genes, group.by = 'wsnn_res.0.7') + RotatedAxis()
To explore more specific subtypes (e.g., neural progenitor cell (NPC) subtypes):
npc_list <- as.data.frame(do.call(rbind,
lapply(yaml_data$neural_progenitor_cell$subtypes, as.list)))
all_genes <- unlist(npc_list$marker_genes)
dups <- unique(all_genes[duplicated(all_genes)])
npc_markers <- lapply(npc_list$marker_genes, function(genes) {
setdiff(genes, dups)
})
DotPlot(seurat, npc_markers, group.by = 'wsnn_res.0.7') + RotatedAxis()
Alternative: Use Average Cluster Expression
In case you think DotPlot is very fabricated, you can
compute average expression across clusters and visualize known markers
in a heatmap:
avg <- AggregateExpression(seurat, assays = 'RNA', group.by='wsnn_res.0.7',
return.seurat=TRUE)
avg <- NormalizeData(avg)
avg <- GetAssayData(object = avg, assay = "RNA")
marker_df <- read.table('data/markers.txt', sep='\t')
pheatmap(avg[unique(rownames(marker_df)), ],
color = colorRampPalette(c("navy","white", "red"))(20),
breaks = seq(0,2,0.1),
scale = "column", # scaling can change the visual effects
cluster_cols = FALSE, cluster_rows = FALSE,
annotation_row = marker_df %>% select(level1, level2, level3))
Assigning Cell Type Annotations
Using a combination of known markers and clustering structure, we can now assign broad annotations to WNN clusters:
meta <- seurat@meta.data %>%
mutate(stage=case_when(
wsnn_res.0.7 %in% c(0, 14, 3, 9) ~ 'telencephalon',
wsnn_res.0.7 %in% c(1, 2, 5, 6, 8, 13) ~ 'early',
wsnn_res.0.7 %in% c(4, 7, 11, 15, 19) ~ 'nontelencephalon',
wsnn_res.0.7 %in% c(10, 12, 17, 18) ~ 'other',
wsnn_res.0.7 %in% c(16) ~ 'neuron'
)) %>%
mutate(celltype=interaction(stage, wsnn_res.0.7, drop = TRUE))
seurat <- AddMetaData(seurat, meta)
p1 <- DimPlot(seurat,
group.by = "stage",
reduction = "wnn.umap", label = T) & NoAxes() & NoLegend()
p2 <- FeaturePlot(seurat,
marker_genes,
reduction = "wnn.umap") & NoAxes() & NoLegend()
p1 | p2
We can see that some cells appear ambiguous and are blended into the
early developmental stage. If needed, we could further refine this
annotation by running FindSubCluster() to split specific
clusters. But for now, we’ll keep it simple and proceed with this level
of resolution.
Optional: Save the object:
saveRDS(seurat, 'out/bimodal_seurat.Rds')
# If you start from part 5 directly
seurat <- readRDS('out/bimodal_seurat.Rds')
In the context of a CRISPR knockout (KO) experiment, our main goal is to understand how gene expression changes in KO versus control (WT) cells. For simplicity, we’ll focus on telencephalon progenitors.
Start by checking how the samples are distributed across cell stages:
# Have a look
table(seurat$sample, seurat$stage)
We’ll subset only the telencephalon cells:
seurat_sub <- subset(seurat, stage == 'telencephalon')
To keep things simple, we combine KO1 and KO2 into a single group and compare them against WT:
seurat_sub$group <- ifelse(seurat_sub$sample %in% c("KO1", "KO2"), "KO", "WT")
res_ko_vs_wt <- wilcoxauc(seurat_sub, 'group', seurat_assay = 'RNA')
res_ko_vs_wt_only <- res_ko_vs_wt[res_ko_vs_wt$group == "KO", ]
res_ko_vs_wt_only %>%
filter(abs(logFC) > log(1.2) &
padj < 0.05) %>%
group_by(group) %>%
top_n(10, wt = auc)
NOTE: If you want to compare KO1 vs WT and KO2 vs WT
individually, subset the data before running wilcoxauc()
since it compares each group to all others. Alternatively, you can use
FindMarkers() with ident.1 and
ident.2.
Pseudo-bulk DE Analysis with DESeq2
The best practices for DEG testing in single-cell data are still evolving. One strategy to reduce false positives is to perform pseudo-bulk DE by aggregating counts at the sample level.
pseudo_bulk <- AggregateExpression(seurat_sub, assays = 'RNA', group.by='sample')$RNA
samples <- colnames(pseudo_bulk)
condition <- ifelse(samples %in% c("KO1", "KO2"), "KO", "WT")
sample_info <- data.frame(
sample = samples,
condition = condition
)
dds <- DESeqDataSetFromMatrix(countData = pseudo_bulk,
colData = sample_info,
design = ~ condition)
dds <- DESeq(dds)
# extract the results of KOs vs WT
res <- results(dds, contrast = c("condition", "KO", "WT"))
res_df <- as.data.frame(res)
res_df <- res_df[order(res_df$padj), ]
head(res_df, n=10)
Volcano Plots: Single-cell vs Pseudo-bulk DE
## Plot single cell
res_ko_vs_wt_only$log2FC = res_ko_vs_wt_only$logFC/log(2)
res_ko_vs_wt_only$significance <- ifelse(res_ko_vs_wt_only$padj < 0.05 & abs(res_ko_vs_wt_only$log2FC) > 0.2, "Significant", "Not significant")
# Select top 10 most significant genes
top_genes <- head(res_ko_vs_wt_only[order(res_ko_vs_wt_only$padj), "feature"], 10)
top_genes <- c(top_genes, 'EMX2', 'GLI3','PAX6','SOX4','SOX11')
res_ko_vs_wt_only$label <- ifelse(res_ko_vs_wt_only$feature %in% top_genes, res_ko_vs_wt_only$feature, NA)
p1 <- ggplot(res_ko_vs_wt_only, aes(x = log2FC, y = -log10(padj), color = significance)) +
geom_point(alpha = 0.8) +
geom_text_repel(aes(label = label), max.overlaps = 50) +
scale_color_manual(values = c("grey", "red")) +
theme_minimal() +
labs(title = "Volcano Plot (Single-cell DE)", x = "log2 Fold Change", y = "-log10(adjusted p-value)")
## Plot minibulk
res_df$gene <- rownames(res_df)
res_df$significance <- ifelse(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 0.2, "Significant", "Not significant")
top_genes <- head(res_df[order(res_df$padj), "gene"], 10)
top_genes <- c(top_genes, 'EMX2', 'GLI3','PAX6','SOX4','SOX11')
res_df$label <- ifelse(res_df$gene %in% top_genes, res_df$gene, NA)
p2 <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = significance)) +
geom_point(alpha = 0.8) +
geom_text_repel(aes(label = label), max.overlaps = 50) +
scale_color_manual(values = c("grey", "blue")) +
theme_minimal() +
labs(title = "Volcano Plot (Pseudo-bulk DE)", x = "log2 Fold Change", y = "-log10(adjusted p-value)")
p1 & NoLegend()| p2 & NoLegend()
So far, we’ve been treating the ATAC assay as just another data layer—but what additional biological insights can it provide?
Chromatin accessibility reflects the openness of genomic regions. The more “open” a region is, the more likely it is accessible to transcription factors and other regulatory proteins. This accessibility often correlates with regulatory activity.
The power of multiome data is that we can measure both chromatin accessibility and gene expression in the same cells, enabling direct insights into gene regulation.
In this step, we’ll: 1. Use RegionStats() to compute
base composition information for peaks. 2. Use LinkPeaks()
to associate peaks with nearby genes based on correlation between
accessibility and expression.
This section is mostly inspired from this section of tutorial.
library(BSgenome.Hsapiens.UCSC.hg38)
seurat_sub <- RegionStats(seurat_sub,assay = 'ATAC',
genome = BSgenome.Hsapiens.UCSC.hg38)
seurat_sub <- LinkPeaks(seurat_sub,
peak.assay = "ATAC",
expression.assay = "RNA",
genes.use = top_genes)
You can visualize peak-gene links with CoveragePlot() and explore specific regulatory relationships using browser-style views.
DefaultAssay(seurat_sub) <- "ATAC"
p1 <- CoveragePlot(seurat_sub,
region = "HES4",
features = "HES4",
group.by = "sample",
extend.upstream = 5000,
extend.downstream = 5000)
p2 <- CoveragePlot(seurat_sub,
region = "PAX2",
features = "PAX2",
group.by = "sample",
extend.upstream = 5000,
extend.downstream = 5000)
patchwork::wrap_plots(p1, p2, ncol = 1)
Following the identification of differentially accessible chromatin regions in Step 2, we now want to ask: Which transcription factors (TFs) might be driving these changes in chromatin accessibility?
This is commonly done by analyzing enrichment of TF binding motifs in the accessible regions.
We’ll use the JASPAR2020 database of vertebrate motifs and annotate peaks with known TF motifs using AddMotifs().
library(TFBSTools)
library(JASPAR2020)
pfm <- getMatrixSet(
x = JASPAR2020,
opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE)
)
df_pfm <- data.frame(t(sapply(pfm, function(x)
c(id=x@ID, name=x@name, symbol=ifelse(!is.null(x@tags$symbol),x@tags$symbol,NA)))))
seurat_sub <- AddMotifs(seurat_sub, genome = BSgenome.Hsapiens.UCSC.hg38, pfm = pfm)
Similar to gene expression analysis, we now test for differentially accessible peaks using the ATAC assay.
res_ko_vs_wt_atac <- wilcoxauc(seurat_sub, 'group', seurat_assay = 'ATAC')
res_ko_vs_wt_atac <- res_ko_vs_wt_atac[res_ko_vs_wt_atac$group == "KO", ]
top_ko_atac <-
res_ko_vs_wt_atac %>%
filter(padj < 0.01 &
auc > 0.6)
top_ko_atac
You can visualize accessibility differences in specific regions using
CoveragePlot():
CoveragePlot(seurat_sub,
region = "chr11-12214953-12215858",
group.by = "sample",
extend.upstream = 5000,
extend.downstream = 5000)
Before running motif enrichment, we match background peaks based on GC content and other region statistics to control for biases.
atac_peaks <- AccessiblePeaks(seurat_sub)
peaks_matched <- MatchRegionStats(
meta.feature = seurat_sub[['ATAC']]@meta.features[atac_peaks, ], # Get the region stats, like GC% etc
query.feature = seurat_sub[['ATAC']]@meta.features[top_ko_atac$feature, ])
Now we can perform motif enrichment on the differentially accessible peaks:
motif_enrichment_ko <- FindMotifs(seurat_sub,
features = top_ko_atac$feature,
background = peaks_matched) %>%
mutate(symbol = setNames(ifelse(is.na(df_pfm$symbol),
df_pfm$name, df_pfm$symbol), df_pfm$id)[motif]) %>%
mutate(padj = p.adjust(pvalue, method="BH"))
top_ko_motif <- motif_enrichment_ko %>%
filter(pvalue < 0.01 & fold.enrichment > 3) ##NOTE: you can choose the filter criteria
top_ko_motif
We can visualize the motifs:
What does this tell us?
The enriched motifs in KO-specific accessible peaks may point to TFs that are more active or upstream regulators of the gene expression programs altered by the knockout.
Now that we’ve covered the core multimodal workflow—visualization, integration, annotation, differential analysis, and motif enrichment. Below are some ideas for additional downstream steps: